There is a newer version of the record available.

Published February 18, 2022 | Version 0.1.0
Software Open

Synthsonic: Fast, Probabilistic modeling and Synthesis of Tabular Data

  • 1. ING Bank, University of Amsterdam
  • 2. ING Bank

Description

The creation of realistic, synthetic datasets has several purposes with growing demand in recent times, e.g. privacy protection and other cases where real data cannot be easily shared. A multitude of primarily neural networks (NNs), e.g. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), or Bayesian Network (BN) approaches have been created to tackle this problem, however these require extensive compute resources, lack interpretability, and in some instances lack replication fidelity as well. We propose a hybrid, probabilistic approach for synthesizing pairwise independent tabular data, called Synthsonic. A sequence of well-understood, invertible statistical transformations removes first-order correlations, then a Bayesian Network jointly models continuous and categorical variables, and a calibrated discriminative learner captures the remaining dependencies. Replication studies on MIT's SDGym benchmark show marginally or significantly better performance than all prior BN-based approaches, while being competitive with NN-based approaches (first place in 10 out of 13 benchmark datasets). The computational time required to learn the data distribution is at least one order of magnitude lower than the NN methods. Furthermore, inspecting intermediate results during the synthetic data generation allows easy diagnostics and tailored corrections. We believe the combination of out-of-the-box performance, speed and interpretability make this method a significant addition to the synthetic data generation toolbox.

Files

synthsonic.zip

Files (966.7 kB)

Name Size Download all
md5:81fb5df643a2560d288179a0a7e86085
966.7 kB Preview Download

Additional details

References

  • See AIStats conference proceedings.